Google Data Analytics Professional Capstone Project - BellaBeat

Bellabeat, a high-tech manufacturer of health-focused products for women. In order to answer the key business questions, I followed the steps of the data analysis process: ask, prepare, process, analyze, share, and act.

I acted as a junior data analyst working on the marketing analyst team at Bellabeat. The Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. As part of the data analytics team, I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights I discover will then help guide marketing strategy for the company. I will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy. Analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. The analysis is to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, make high-level recommendations for how these trends can inform Bellabeat marketing strategy.

Step 1: Ask

In my hypothetical role as the Junior data analyst for Bellabeat’s marketing team, the CCO asks me to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants me to select one Bellabeat product to apply these insights to in your presentation. These questions guided my analysis:

  1. What are some trends in smart device usage?
  1. How could these trends apply to Bellabeat customers?
  1. How could these trends help influence Bellabeat marketing strategy?

As part of the ask process of data analysis, I identified the key task of the analysis and the stakeholders who I would make recommendations based on the insights derived from the data.

Key tasks

1. Identify the business task : The main business task is to use the smart devices data to analyse how users are using smart devices to monitor their activity.

2. Consider key stakeholders : the main stakeholders of the analysis are the executive team which comprises of the CCO &co-founders,the customer facing team who are the marketing team, and the analytics team who are carrying out the analysis.

Deliverable:  A clear statement of the business task

The main deliverable of the Ask Phase is a clear business task which I documented as below:

Bellabeat would like to identify the usage of lifestyle lifestyle smart devices by users. In order to do so, we will analyse the data collected from smart devices users who do not use BellaBeat to identify trends of how they use smart devices. This will help influence the marketing strategy of BellaBeat to advertise their products to more users.

Step 2: Prepare

The CCO encourages the use public data that explores smart device users’ daily habits. She points us to a specific data set:

FitBit Fitness Tracker Data (CC0:Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

The CCO tells us that this data set might have some limitations, and encourages us to consider adding another data to help address those limitations as you begin to work more with this data.

The guiding questions in the preparation of the data are as below:

Where is your data stored?

The data is stored on Kaggle as part of the FitBit Fitness Tracker dataset made available through Mobius. The data set contains personal fitness tracker from thirty fitbit users who consented the collection of their personal tracker data including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’habits.

How is the data organized? Is it in long or wide format?

The data is stored in csv files which are organised in terms of minute, hourly and daily measurements of calories intake, steps, sleep and activity intensity.

Are there issues with bias or credibility in this data? Does your data ROCCC?

To access the credibility of the data, I followed the ROCCC criteria by Google. For each of the datasets I determined if the data is:

  1. Reliable: The dataset is reliable for the analysis as it contains fitness data relating to sleep, steps taken, calory intake and heart rate. The data helps us to get insights into the use of fitbits by users.
  2. Original: Data was collected by fitness tracker users.
  3. Comprehensive: Data was organised in according to minute, hour and daily measurements hence it was easy to understand.
  4. Current: Dataset comprised of user activity for the 12 months in 2016 which is current.
  5. Cited: The data is not cited, it is however available on Kaggle.

How are you addressing licensing, privacy, security, and accessibility?

The data was anonymised with no PII included in the data. The data was classified using a unique ID that was used to link the different datasets.

How did you verify the data’s integrity?

The data can be said to have integrity as it is publicly available on Kaggle which is a public platform for data. To check the dataset integrity, I confirmed the entity and referential integrity of the data. The data had unique Ids that correlated to the users who submitted the data. The unique Id was consistent throughout all the datasets.

How does it help you answer your question?

The dataset contains information collected by 30 fitbit users over two months. This data will help me identify the trends in the use of fitbit trackers hence answering the questions in the business task.

● Are there any problems with the data?

The data has date columns that is stored as a string. In order to properly analyse the data, I changed all date columns into the datetime format in R using the POSIX function.

Step 3: Process

The process phase is to ensure the data is complete and can be used in the analyze phase. In order to process the data, I used the following guiding questions:

Guiding questions

What tools are you choosing and why?

R: I am using R due to its capabilities in data cleaning, joining the datasets and visualize the data to get key insights to solve the business task.

Have you ensured your data’s integrity?

I ensured that the data had entity and referential integrity by confirming all user Ids were consistent in all datasets.

What steps have you taken to ensure that your data is clean?

To ensure my data is clean, I followed the below steps:

1) Removed duplicate rows

2) Removed rows that have null columns

3) Ensured the datetime columns are properly formatted.

● How can you verify that your data is clean and ready to analyze? To verify that the data was correct to use, analysed the datasets’ summary statistics in R. This will show if there are any nulls in the data, identify the mean, min, max and avg of numerical variables to get a highlight of the data.

● Have you documented your cleaning process so you can review and share those results?

The clean up documentation for the analysis was documented using R markdown file which contains the code and the output of the clean-up. This ensures that anyone can access the data, code used for clean-up and can replicate the processes that were followed to clean the data . This will also ensure credibility of the data by enabling review of work done.

The following is the code used in loading and cleaning the data:

1) Load libraries needed to clean the data.

#Libraries used
library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
library(tidyr)
library(lubridate)

Attaching package: ‘lubridate’

The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union
library(plotly)
library(hrbrthemes)
Warning: package ‘hrbrthemes’ was built under R version 4.1.3
Registering Windows fonts with R
NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
      Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
      if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow

2) Prepare the data

Load the data used for analysis. The data is classified as daily and hourly.

#Daily Data
daily_calories<-read.csv("Bellabeat/dailyCalories_merged.csv")
daily_activity<-read.csv("Bellabeat/dailyActivity_merged.csv")
daily_steps<-read.csv("Bellabeat/dailySteps_merged.csv")
daily_intensities<-read.csv("Bellabeat/dailyIntensities_merged.csv")
dailysleep<-read.csv("Bellabeat/sleepDay_merged.csv")
#Hourly data
hourly_calories<-read.csv("Bellabeat/hourlyCalories_merged.csv")
hourly_steps<-read.csv("Bellabeat/hourlySteps_merged.csv")
hourly_intensities<-read.csv("Bellabeat/hourlyIntensities_merged.csv")

2) Clean the data

#Daily Activity Data
#1)Remove rows with nulls
daily_activity<-na.omit(daily_activity)
#2)Change date format
daily_activity$ActivityDate=as.POSIXct(daily_activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
#3)Remove duplicates
daily_activity%>%distinct(daily_activity$Id, .keep_all = TRUE)
print(head(daily_activity))

Get a summary of the data

#Get a summary of the data
summary(daily_activity)
       Id            ActivityDate         TotalSteps    TotalDistance    TrackerDistance 
 Min.   :1.504e+09   Length:940         Min.   :    0   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:2.320e+09   Class :character   1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 2.620  
 Median :4.445e+09   Mode  :character   Median : 7406   Median : 5.245   Median : 5.245  
 Mean   :4.855e+09                      Mean   : 7638   Mean   : 5.490   Mean   : 5.475  
 3rd Qu.:6.962e+09                      3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.: 7.710  
 Max.   :8.878e+09                      Max.   :36019   Max.   :28.030   Max.   :28.030  
 LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
 Min.   :0.0000           Min.   : 0.000     Min.   :0.0000           Min.   : 0.000     
 1st Qu.:0.0000           1st Qu.: 0.000     1st Qu.:0.0000           1st Qu.: 1.945     
 Median :0.0000           Median : 0.210     Median :0.2400           Median : 3.365     
 Mean   :0.1082           Mean   : 1.503     Mean   :0.5675           Mean   : 3.341     
 3rd Qu.:0.0000           3rd Qu.: 2.053     3rd Qu.:0.8000           3rd Qu.: 4.782     
 Max.   :4.9421           Max.   :21.920     Max.   :6.4800           Max.   :10.710     
 SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes
 Min.   :0.000000        Min.   :  0.00    Min.   :  0.00      Min.   :  0.0       
 1st Qu.:0.000000        1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:127.0       
 Median :0.000000        Median :  4.00    Median :  6.00      Median :199.0       
 Mean   :0.001606        Mean   : 21.16    Mean   : 13.56      Mean   :192.8       
 3rd Qu.:0.000000        3rd Qu.: 32.00    3rd Qu.: 19.00      3rd Qu.:264.0       
 Max.   :0.110000        Max.   :210.00    Max.   :143.00      Max.   :518.0       
 SedentaryMinutes    Calories   
 Min.   :   0.0   Min.   :   0  
 1st Qu.: 729.8   1st Qu.:1828  
 Median :1057.5   Median :2134  
 Mean   : 991.2   Mean   :2304  
 3rd Qu.:1229.5   3rd Qu.:2793  
 Max.   :1440.0   Max.   :4900  
#Daily Sleep
#Get the description of the dataframe.
str(dailysleep)
'data.frame':   413 obs. of  5 variables:
 $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ SleepDay          : POSIXct, format: "2016-04-12" "2016-04-13" "2016-04-15" ...
 $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
 $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
 $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...
#1)Remove rows with nulls
dailysleep<-na.omit(dailysleep)
#2)Remove duplicates
dailysleep%>%distinct(dailysleep$Id, .keep_all = TRUE)
#3)Clean date format
dailysleep[['SleepDay']] <- as.POSIXct(dailysleep$SleepDay, format="%m/%d/%Y", tz=Sys.timezone())
#Rename
names(dailysleep)[names(dailysleep) == 'SleepDay'] <- "ActivityDate"
print(head(dailysleep))
#Hourly calories
str(hourly_calories)
'data.frame':   22099 obs. of  4 variables:
 $ Id          : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ ActivityHour: POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 01:00:00" "2016-04-12 02:00:00" ...
 $ Calories    : int  81 61 59 47 48 48 48 47 68 141 ...
 $ Time        : chr  "00:00:00" "01:00:00" "02:00:00" "03:00:00" ...
hourly_calories$ActivityHour=as.POSIXct(hourly_calories$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
hourly_calories$Time<-format(hourly_calories$ActivityHour, format = "%H:%M:%S")
print(head(hourly_calories))
#Hourly Steps
str(hourly_steps)
'data.frame':   22099 obs. of  4 variables:
 $ Id          : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ ActivityHour: POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 01:00:00" "2016-04-12 02:00:00" ...
 $ StepTotal   : int  373 160 151 0 0 0 0 0 250 1864 ...
 $ Time        : chr  "00:00:00" "01:00:00" "02:00:00" "03:00:00" ...
hourly_steps$ActivityHour=as.POSIXct(hourly_steps$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
hourly_steps$Time<-format(hourly_steps$ActivityHour, format = "%H:%M:%S")
print(head(hourly_steps))
#Merge sleep dataset with the sleep dataset
Daily_data<-inner_join(daily_activity, dailysleep, by=c("Id","ActivityDate"))
Daily_data$month <-months(Daily_data$ActivityDate)
Daily_data$weekday <-weekdays(Daily_data$ActivityDate)
print(head(Daily_data))
NA

Step 4 : Analyse and Share

In order to visualize the data, I had to analyse the datasets based on the analysis. I subset the datasets to gain more insights from the data and identify trends that can be used to solve the business task.

Insight 1: How many calories are users burning per hour?

This will help us get insights on when the users are most active during the day.

#Bar chart of calories burnt per hour
#Libraries
library(ggplot2)
library(hrbrthemes)
#Preparation of dataframe
hourly_cal_new <- hourly_calories %>%
  group_by(Time) %>%
  drop_na() %>%
  summarise(total_hourly_calories = sum(Calories))

#BarPlot
ggplot(data=hourly_cal_new, aes(x=Time, y=total_hourly_calories)) + geom_histogram(stat = "identity", fill='darkblue') +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title="Total Calories vs. Time")
Warning: Ignoring unknown parameters: binwidth, bins, pad

From the visualization, we can note that users are active from 5 am to 11pm. The peak of activity is from 4 pm to 7 pm.

Insight 2: How many step are taken per hour?

This will help us get insights on when the users are most active during the day and when they are out and about getting exercise.

#Line plot of avg steps per hour
# Libraries
library(ggplot2)
library(dplyr)
library(hrbrthemes)

hourly_steps_new <- hourly_steps %>%
  group_by(Time) %>%
  drop_na() %>%
  summarise(total_hourly_steps = mean(StepTotal))
#print(hourly_steps_new)

# Plot
hourly_steps_new %>%
  ggplot( aes(x=Time, y=total_hourly_steps)) +
  geom_line( color="grey") +
  geom_point(shape=21, color="black", fill="#69b3a2", size=2) +
  theme_ipsum() +
  ggtitle("Average Steps per hour")
geom_path: Each group consists of only one observation. Do you need to adjust the group
aesthetic?
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database

From the visualization, we note that users take are active in recording their steps from 6am to 7 pm. The peak of activity from is between 3pm to 6pm.

Insight 3: How long are people winding down in order to get sleep

This will help analyse the sleep patterns of the users.

#Lollipop graph - total time in bed vs time asleep
weekday_sleep <- Daily_data %>%
  group_by(weekday) %>%
  drop_na() %>%
  summarise(total_min_asleep = sum(TotalMinutesAsleep),
            total_min_in_bed = sum(TotalTimeInBed))
print(weekday_sleep)

# Plot
ggplot(weekday_sleep) +
  geom_segment( aes(x=weekday, xend=weekday, y=total_min_asleep, yend=total_min_in_bed), color="grey") +
  geom_point( aes(x=weekday, y=total_min_asleep), color=rgb(0.2,0.7,0.1,0.5), size=3 ) +
  geom_point( aes(x=weekday, y=total_min_in_bed), color=rgb(0.7,0.2,0.1,0.5), size=3 ) +
  coord_flip()+
  theme_ipsum() +
  theme(
    legend.position = "left",
  ) +
  xlab("Weekday") +
  ylab("Total Minutes")+
  ggtitle("Time in Bed vs Time Asleep per weekday")
Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)) :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database

From the lollipop graph, we can identify that most users sleep less on Mondays and they take longer to fall asleep on Sunday. This is a classic case of Sunday insomnia.

Insight 4: How can we classify users of fitbit applications

This will the percentage of users are reliant on fitbit apps and devices to track their activity during the day.

#users
daily_average <- Daily_data %>%
  group_by(Id) %>%
  summarise (mean_daily_steps = mean(TotalSteps),
             mean_daily_calories = mean(Calories),
             mean_distance = mean(TotalDistance),
             mean_daily_sleep = mean(TotalMinutesAsleep))

head(daily_average)
user_type <- daily_average %>%
  mutate(user_type = case_when(
    mean_daily_steps < 5000 ~ "Sedentary",
    mean_daily_steps >= 5000 & mean_daily_steps < 7499 ~ "Lightly active", 
    mean_daily_steps >= 7500 & mean_daily_steps < 9999 ~ "Moderately active", 
    mean_daily_steps >= 10000 ~ "Very active"
  ))

head(user_type)
#pie chart
library(plotly)
user_types<-c("Sedentary users", "Lightly active users", "Moderately active users", "Very active users")
sedentary<-nrow(user_type[user_type$user_type == "Sedentary",])
light<-nrow(user_type[user_type$user_type == "Lightly active",])
moderate<-nrow(user_type[user_type$user_type == "Moderately active",])
active<-nrow(user_type[user_type$user_type == "Very active",])
count_users<-c(sedentary, light, moderate, active)
user_types_df<-data.frame(user_types,count_users)

fig <- plot_ly(user_types_df, labels = ~user_types, values = ~count_users, type = 'pie')
fig <- fig %>% layout(title = 'Classification of Users based on daily total distance',
                      xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
                      yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

fig

From the visualization, we identify that majority of the users are moderately active. This means that users are moderately using fitbit applications to monitor their activity during the day.

Insight 5: Are there correlations between total calories burnt and total steps taken?

This will help identify if the users are tracking the steps which will correlate to the calories burnt data produced by fitbit devices.

#Correlation between total daily calories and steps
ggplot(Daily_data, aes(x=Calories, y=TotalSteps))+
  geom_jitter() +
  geom_smooth(color = "blue") + 
  labs(title = "Daily calories burnt vs Total Steps", x = "Calories burnt", y= "Daily Steps") +
  theme(panel.background = element_blank(),
        plot.title = element_text( size=14))
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

From the visualization, we can see that there is a positive correlation between the calories burnt and the steps taken by a user. The more the steps taken, the more calories a user burns.

Insight 6: Is there a correlation between the total steps and minutes of sleep?

This will help identify if sleeping habits affect the activity of a user during the day.

#Correlation between Steps and Total Minutes Asleep
ggplot(Daily_data, aes(x=TotalSteps, y=TotalMinutesAsleep))+
  geom_jitter() +
  geom_smooth(color = "green") + 
  labs(title = "Daily steps vs Minutes asleep", x = "Daily steps", y= "Minutes asleep") +
  theme(panel.background = element_blank(),
        plot.title = element_text( size=14))
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

From the visualization, we can note that there is no correlation between the minutes of sleep and daily steps. We cannot establish if the more steps one take will assure a good sleep session for a user.

Step 5: Act

From the insights derived from the sample fitbit datasets, I would recommend the following to the stakeholders in order to boost the use of BellaBeat app and how this can boost marketing of the use of Leaf, Time and Spring.

1) Increase the intensity of online marketting from 6 am - 10 am , 2pm - 3pm, and 8pm - 23pm Scheduling adverts on media channels during the hours will help users think of using BellaBeat products when they are more active.

2) Introduce step notifications in the BellaBeat App to notify users at 12pm, 3pm and 9pm of their step count. This will help users monitor their step count and plan their daily activities

3) Introduce sleep notification to help users not use their phones when they are in bed. This will help them get adequate sleep during the night.

4) Increase marketting campaigns to target light and sedentary users by advertising how BellaBeat can help them meet their fitness goals based on activity logged and how it corresponds to calories burnt.

5) Introduce a reward system for users to encourage users to log their activity in BellaBeat App. We can also help introduce them to other BellaBeat products that will automatically log the information for them without the need of manually entering the information.

---
title: "Bellabeat Case Study"
output: html_notebook
---

# **Google Data Analytics Professional Capstone Project - BellaBeat**

Bellabeat, a high-tech manufacturer of health-focused products for women. In order to answer the key business questions, I followed the steps of the data analysis process: **ask**, **prepare**, **process**, **analyze**, **share**, and **act**.

I acted as a junior data analyst working on the marketing analyst team at Bellabeat. The
Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. As part of the data analytics team, I have been asked to focus on one of Bellabeat's products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights I discover will then help guide marketing strategy for the company. I will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat's marketing strategy. Analysis of Bellabeat's available consumer data would reveal more opportunities for growth. The analysis is to focus on a
Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, make high-level recommendations for how these trends can inform Bellabeat marketing strategy.

## **Step 1: Ask**

In my hypothetical role as the Junior data analyst for Bellabeat's marketing team, the CCO asks me to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants me to select one Bellabeat product to apply these insights to in your presentation. These questions guided my analysis:

1.  What are some trends in smart device usage?

```{=html}
<!-- -->
```
2.  How could these trends apply to Bellabeat customers?

```{=html}
<!-- -->
```
3.  How could these trends help influence Bellabeat marketing strategy?

As part of the ask process of data analysis, I identified the key task of the analysis and the stakeholders who I would make recommendations based on the insights derived from the data.

Key tasks

1\. Identify the business task : The main business task is to use
the smart devices data to analyse how users are using smart devices to monitor
their activity.

2\. Consider key stakeholders : the main stakeholders of the
analysis are the executive team which comprises of the CCO &co-founders,the
customer facing team who are the marketing team, and the analytics team who are
carrying out the analysis.

**Deliverable:  A clear statement of the business task**

The main deliverable of the Ask Phase is a clear business task which I documented as below:

Bellabeat would like to identify the usage of lifestyle lifestyle smart devices by users. In order to do so, we will analyse the data collected from smart devices users who do not use BellaBeat to identify trends of how they use smart devices. This will help influence the marketing strategy of BellaBeat to
advertise their products to more users.

## **Step 2: Prepare**

The CCO encourages the use public data that explores smart device users' daily habits. She points us to a specific data set:

● [**FitBit Fitness Tracker Data**]{.ul} (CC0:Public Domain, dataset made available through [Mobius]{.ul}): This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users' habits.

The CCO tells us that this data set might have some limitations, and encourages us to consider adding another data to help address those limitations as you begin to work more with this data.

The guiding questions in the preparation of the data are as below:

● **Where is your data stored?**

The data is stored on Kaggle as part of the FitBit Fitness Tracker dataset made available through Mobius. The data set contains personal fitness tracker from thirty fitbit users who consented the
collection of their personal tracker data including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users'habits.

● **How is the data organized? Is it in long or wide format?**

The data is stored in csv files which are organised in terms of minute, hourly and daily measurements of calories intake, steps, sleep and activity intensity.

● **Are there issues with bias or credibility in this data? Does your data ROCCC?**

To access the credibility of the data, I followed the ROCCC criteria by Google. For each of the datasets I determined if the data is:

1.  Reliable: The dataset is reliable for the analysis as it contains fitness data relating to sleep, steps taken, calory intake and heart rate. The data helps us to get insights into the use of fitbits by users.
2.  Original: Data was collected by fitness tracker users.
3.  Comprehensive: Data was organised in according to minute, hour and daily measurements hence it was easy to understand.
4.  Current: Dataset comprised of user activity for the 12 months in 2016 which is current.
5.  Cited: The data is not cited, it is however available on Kaggle.

● **How are you addressing licensing, privacy, security, and accessibility?**

The data was anonymised with no PII included in the data. The data was classified using a unique ID that was used to link the different datasets.

● **How did you verify the data's integrity?**

The data can be said to have integrity as it is publicly available on Kaggle which is a public platform for data. To check the dataset integrity, I confirmed the entity and referential integrity of the data. The data had unique Ids that correlated to the users who submitted the data. The unique Id was consistent throughout all the datasets.

● **How does it help you answer your question?**

The dataset contains information collected by 30 fitbit users over two months. This data will help me identify the trends in the use of fitbit trackers hence answering the questions in the business task.

**● Are there any problems with the data?**

The data has date columns that is stored as a string. In order to properly analyse the data, I changed all date columns into the datetime format in R using the POSIX function.

## Step 3: Process

The process phase is to ensure the data is complete and can be used in the analyze phase. In order to process the data, I used the following guiding questions:

Guiding questions

● **What tools are you choosing and why?**

R: I am using R due to its capabilities in data cleaning, joining the datasets and visualize the data to get key insights to solve the business task.

● **Have you ensured your data's integrity?**

I ensured that the data had entity and referential integrity by confirming all user Ids were consistent in all datasets.

● **What steps have you taken to ensure that your data is clean?**

To ensure my data is clean, I followed the below steps:

1\) Removed duplicate rows

2\) Removed rows that have null columns

3\) Ensured the datetime columns are properly formatted.

**● How can you verify that your data is clean and ready to analyze?**
To verify that the data was correct to use, analysed the datasets' summary statistics in R. This will show if there are any nulls in the data, identify the mean, min, max and avg of numerical variables to get a highlight of the data.

**● Have you documented your cleaning process so you can review and share those results?**

The clean up documentation for the analysis was documented using R markdown file which contains the code and the output of the clean-up. This ensures that anyone can access the data, code used for clean-up and can replicate the processes that were followed to clean the data . This will also ensure credibility of the data by enabling review of work done.

The following is the code used in loading and cleaning the data:

1\) Load libraries needed to clean the data.

```{r}
#Libraries used
library(dplyr)
library(tidyr)
library(lubridate)
```

2\) Prepare the data

Load the data used for analysis. The data is classified as daily and hourly.

```{r}
#Daily Data
daily_calories<-read.csv("Bellabeat/dailyCalories_merged.csv")
daily_activity<-read.csv("Bellabeat/dailyActivity_merged.csv")
daily_steps<-read.csv("Bellabeat/dailySteps_merged.csv")
daily_intensities<-read.csv("Bellabeat/dailyIntensities_merged.csv")
dailysleep<-read.csv("Bellabeat/sleepDay_merged.csv")
```

```{r}
#Hourly data
hourly_calories<-read.csv("Bellabeat/hourlyCalories_merged.csv")
hourly_steps<-read.csv("Bellabeat/hourlySteps_merged.csv")
hourly_intensities<-read.csv("Bellabeat/hourlyIntensities_merged.csv")
```

2\) Clean the data

```{r}
#Daily Activity Data
#1)Remove rows with nulls
daily_activity<-na.omit(daily_activity)
#2)Change date format
daily_activity$ActivityDate=as.POSIXct(daily_activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
#3)Remove duplicates
daily_activity%>%distinct(daily_activity$Id, .keep_all = TRUE)
print(head(daily_activity))
```

Get a summary of the data

```{r}
#Get a summary of the data
summary(daily_activity)

```

```{r}
#Daily Sleep
#Get the description of the dataframe.
str(dailysleep)
#1)Remove rows with nulls
dailysleep<-na.omit(dailysleep)
#2)Remove duplicates
dailysleep%>%distinct(dailysleep$Id, .keep_all = TRUE)
#3)Clean date format
dailysleep[['SleepDay']] <- as.POSIXct(dailysleep$SleepDay, format="%m/%d/%Y", tz=Sys.timezone())
#Rename
names(dailysleep)[names(dailysleep) == 'SleepDay'] <- "ActivityDate"
print(head(dailysleep))
```

```{r}
#Hourly calories
str(hourly_calories)
hourly_calories$ActivityHour=as.POSIXct(hourly_calories$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
hourly_calories$Time<-format(hourly_calories$ActivityHour, format = "%H:%M:%S")
print(head(hourly_calories))
```

```{r}
#Hourly Steps
str(hourly_steps)
hourly_steps$ActivityHour=as.POSIXct(hourly_steps$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
hourly_steps$Time<-format(hourly_steps$ActivityHour, format = "%H:%M:%S")
print(head(hourly_steps))
```

```{r}
#Merge sleep dataset with the sleep dataset
Daily_data<-inner_join(daily_activity, dailysleep, by=c("Id","ActivityDate"))
Daily_data$month <-months(Daily_data$ActivityDate)
Daily_data$weekday <-weekdays(Daily_data$ActivityDate)
print(head(Daily_data))

```

## Step 4 : Analyse and Share 

In order to visualize the data, I had to analyse the datasets based on the analysis. I subset the datasets to gain more insights from the data and identify trends that can be used to solve the business task.

### Insight 1: How many calories are users burning per hour?

This will help us get insights on when the users are most active during the day.

```{r}
library(plotly)
library(hrbrthemes)
library(ggplot2)
#Bar chart of calories burnt per hour
#Libraries
library(ggplot2)
library(hrbrthemes)
#Preparation of dataframe
hourly_cal_new <- hourly_calories %>%
  group_by(Time) %>%
  drop_na() %>%
  summarise(total_hourly_calories = sum(Calories))

#BarPlot
ggplot(data=hourly_cal_new, aes(x=Time, y=total_hourly_calories)) + geom_histogram(stat = "identity", fill='darkblue') +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title="Total Calories vs. Time")
```

From the visualization, we can note that users are active from 5 am to 11pm. The peak of activity is from 4 pm to 7 pm.

### Insight 2:  How many step are taken per hour?

This will help us get insights on when the users are most active during the day and when they are out and about getting exercise.

```{r}
#Line plot of avg steps per hour
# Libraries
library(ggplot2)
library(dplyr)
library(hrbrthemes)

hourly_steps_new <- hourly_steps %>%
  group_by(Time) %>%
  drop_na() %>%
  summarise(total_hourly_steps = mean(StepTotal))
#print(hourly_steps_new)

# Plot
hourly_steps_new %>%
  ggplot( aes(x=Time, y=total_hourly_steps)) +
  geom_line( color="grey") +
  geom_point(shape=21, color="black", fill="#69b3a2", size=2) +
  theme_ipsum() +
  ggtitle("Average Steps per hour")
```

From the visualization, we note that users take are active in recording their steps from 6am to 7 pm. The peak of activity from is between 3pm to 6pm.

### Insight 3: How long are people winding down in order to get sleep

This will help analyse the sleep patterns of the users.

```{r}
#Lollipop graph - total time in bed vs time asleep
weekday_sleep <- Daily_data %>%
  group_by(weekday) %>%
  drop_na() %>%
  summarise(total_min_asleep = sum(TotalMinutesAsleep),
            total_min_in_bed = sum(TotalTimeInBed))
print(weekday_sleep)

# Plot
ggplot(weekday_sleep) +
  geom_segment( aes(x=weekday, xend=weekday, y=total_min_asleep, yend=total_min_in_bed), color="grey") +
  geom_point( aes(x=weekday, y=total_min_asleep), color=rgb(0.2,0.7,0.1,0.5), size=3 ) +
  geom_point( aes(x=weekday, y=total_min_in_bed), color=rgb(0.7,0.2,0.1,0.5), size=3 ) +
  coord_flip()+
  theme_ipsum() +
  theme(
    legend.position = "left",
  ) +
  xlab("Weekday") +
  ylab("Total Minutes")+
  ggtitle("Time in Bed vs Time Asleep per weekday")
```

From the lollipop graph, we can identify that most users sleep less on Mondays and they take longer to fall asleep on Sunday. This is a classic case of Sunday insomnia.

### Insight 4: How can we classify users of fitbit applications

This will the percentage of users are reliant on fitbit apps and devices to track their activity during the day.

```{r}
#users
daily_average <- Daily_data %>%
  group_by(Id) %>%
  summarise (mean_daily_steps = mean(TotalSteps),
             mean_daily_calories = mean(Calories),
             mean_distance = mean(TotalDistance),
             mean_daily_sleep = mean(TotalMinutesAsleep))

head(daily_average)
user_type <- daily_average %>%
  mutate(user_type = case_when(
    mean_daily_steps < 5000 ~ "Sedentary",
    mean_daily_steps >= 5000 & mean_daily_steps < 7499 ~ "Lightly active", 
    mean_daily_steps >= 7500 & mean_daily_steps < 9999 ~ "Moderately active", 
    mean_daily_steps >= 10000 ~ "Very active"
  ))

head(user_type)
```

```{r}
#pie chart
library(plotly)
user_types<-c("Sedentary users", "Lightly active users", "Moderately active users", "Very active users")
sedentary<-nrow(user_type[user_type$user_type == "Sedentary",])
light<-nrow(user_type[user_type$user_type == "Lightly active",])
moderate<-nrow(user_type[user_type$user_type == "Moderately active",])
active<-nrow(user_type[user_type$user_type == "Very active",])
count_users<-c(sedentary, light, moderate, active)
user_types_df<-data.frame(user_types,count_users)

fig <- plot_ly(user_types_df, labels = ~user_types, values = ~count_users, type = 'pie')
fig <- fig %>% layout(title = 'Classification of Users based on daily total distance',
                      xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
                      yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

fig
```

From the visualization, we identify that majority of the users are moderately active. This means that users are moderately using fitbit applications to monitor their activity during the day.

### Insight 5: Are there correlations between total calories burnt and total steps taken?

This will help identify if the users are tracking the steps which will correlate to the calories burnt data produced by fitbit devices.

```{r}
#Correlation between total daily calories and steps
ggplot(Daily_data, aes(x=Calories, y=TotalSteps))+
  geom_jitter() +
  geom_smooth(color = "blue") + 
  labs(title = "Daily calories burnt vs Total Steps", x = "Calories burnt", y= "Daily Steps") +
  theme(panel.background = element_blank(),
        plot.title = element_text( size=14))
```

From the visualization, we can see that there is a positive correlation between the calories burnt and the steps taken by a user. The more the steps taken, the more calories a user burns.

### Insight 6: Is there a correlation between the total steps and minutes of sleep?

This will help identify if sleeping habits affect the activity of a user during the day.

```{r}
#Correlation between Steps and Total Minutes Asleep
ggplot(Daily_data, aes(x=TotalSteps, y=TotalMinutesAsleep))+
  geom_jitter() +
  geom_smooth(color = "green") + 
  labs(title = "Daily steps vs Minutes asleep", x = "Daily steps", y= "Minutes asleep") +
  theme(panel.background = element_blank(),
        plot.title = element_text( size=14))
```

From the visualization, we can note that there is no correlation between the minutes of sleep and daily steps. We cannot establish if the more steps one take will assure a good sleep session for a user.

## Step 5: Act

From the insights derived from the sample fitbit datasets, I would recommend the following to the stakeholders in order to boost the use of BellaBeat app and how this can boost marketing of the use of Leaf, Time and Spring.

1\) Increase the intensity of online marketting from 6 am - 10 am , 2pm - 3pm, and 8pm - 23pm Scheduling adverts on media channels during the hours will help users think of using BellaBeat products when they are more active.

2\) Introduce step notifications in the BellaBeat App to notify users at 12pm, 3pm and 9pm of their step count. This will help users monitor their step count and plan their daily activities

3\) Introduce sleep notification to help users not use their phones when they are in bed. This will help them get adequate sleep during the night.

4\) Increase marketting campaigns to target light and sedentary users by advertising how BellaBeat can help them meet their fitness goals based on activity logged and how it corresponds to calories burnt.

5\) Introduce a reward system for users to encourage users to log their activity in BellaBeat App. We can also help introduce them to other BellaBeat products that will automatically log the information for them without the need of manually entering the information.
